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Where: 

k=lpCDA 

p  =  atmospheric  density 
CD  =  drag  coefficient 
A  —  frontal  area 
=  burn  time 
P  =  average  thrust 
m  =  average  thrusting  mass 
mb  =  burnout  mass 
g  =  acceleration  due  to  gravity 
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The  computer  design  market 
(in  millions  produced,  not  value) 


Device  Type 

2001 

2002 

2003 

2004 

2005 

2006 

Computational  MPUs 

147 

151 

168 

187 

202 

218 

Embedded  MPUs 

159 

152 

161 

167 

177 

186 

4-  and  8-bit  MCUs 

3982 

4180 

4570 

5090 

5540 

5970 

16-bit  MCUs 

589 

626 

678 

714 

797 

797 

32-bit  MCUs 

190 

216 

247 

368 

511 

673 

DSP 

484 

526 

569 

638 

694 

691 

Total 

5551 

5851 

6393 

7164 

7921 

8535 

REF:J.  Hines,  “2003  Semiconductor  Manufacturing  Market:  Wafer  Foundry.”  Gartner  Focus  Report,  August  2003. 
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The  computer  design  market  growth 


MPU  Marketplace  Growth 


B.  Lewis,  “Microprocessor  Cores  Shape  Future  SLI/SOC  Market.”  Gartner,  July  2002 
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Moore’s  Law 


•  Something  doubles  (or  is  announced  to 
double)  right  before  a  stockholders  meeting. 

•  So  that  it  really  doubles  about  every  2  years 


Suggested  by:  Greg  Astfalk  lecture 
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Semiconductor  Industry  Roadmap 


Semiconductor  Technology  Roadmap 


Year 

(f)  Technology  generation  (nm) 
Wafer  size  (cm) 

(p)  Defect  density  (per  cm2) 

(A)  jliP  die  size  (cm2) 

!!!  Chip  Frequency  (GHz) 

MTx  per  Chip  (Microprocessor) 

!!!  MaxPwr(W)  High  Performance 


2004 

2007 

2013 

2016 

90 

65 

32 

22 

30 

30 

45 

45 

0.14 

0.14 

0.14 

0.14 

3.1 

3.1 

3.1 

3.1 

4.2 

9.3 

23 

39.6 

553 

1204 

4424 

8848 

158 

189 

251 

288 
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Time  (Performance),  Area  and 

Power  Tradeoffs 
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There’s  a  lot  of  silicon  out  there  . . .  and 

it’s  cheap 


www.newburycamraclub.org.uk 
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Cost  of  Silicon 


REF:J.  Hines,  “2003  Semiconductor  Manufacturing  Market:  Wafer  Foundry.”  Gartner  Focus  Report,  August  2003. 
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The  basics  of  wafer  fab 


A  30  cm,  state  of  the  art  (p  =  0.2)  wafer  fab 
facility  might  cost  $3B  and  require 
$5B/year  sales  to  be  profitable. .  .that’s  at 
least  5M  wafer  starts  and  almost  5B  1cm 
die  /per  year.  A  die  (f=90nm)  has  1 00-200 
M  tx  /cm2. 

Ultimately  at  0($2000)  per  wafer  that’s  $2/ 
cm2  or  1 00M  tx.  So  how  to  use  them? 
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Area  and  Cost 


Is  efficient  use  of  die  area  important? 

Is  processor  cost  important? 

NO,  to  a  point  -  server  processors  (cost 
dominated  by  memory,  power/  cooling, 
etc.) 

YES  -  everything  in  the  middle. 

NO  -  very  small,  embedded  die  which  are 
package  limited.  (10-1  OOMtx/ die) 
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But  it  takes  a  lot  of  effort  to  design  a 

chip 


world’s 
first  CAD 

system? 


©  Canadian  Museum  of 
Civilization  Corporation 
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Logic  Transistors  per  Chip(K) 


Design  time:  CAD  productivity 
limitations  favor  soft  designs 


10,000,000 
lOp  1,000,000 


Logic  Transistors/Chip 
Transistor/Staff  Month 


2.5  p 


100,000  58%/Yr.  compound 

Complexity  growth  rate 


10,000 


1,000 

100 


100,000,00( 

10,000,000 

1,000,000 

100,000 

10,000 

1,000 


21%/Yr.  compound 
Productivity  growth  rate 
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Source:  S.  Malik,  orig.  SEMATECH 
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Productivity 
Trans./Staff  -  Month 


Time 


www.snowder.com 
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High  Speed  Clocking 


Fast  clocks  are  not  primarily  the  result  of  technology 
scaling,  but  rather  of  architecture/logic  techniques: 

-  Smaller  pipeline  segments,  less  clock  overhead 

•  Modem  microprocessors  are  increasing  clock  speed 
more  rapidly  than  anyone  (SIA)  predicted. .  .fast 
clocks  or  hyperclocking  (really  short  pipe  segments) 

•  But  fast  clocks  do  not  by  themselves  increase  system 
performance 
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Change  in  pipe  segment  size 


—  F04  gate 
delays 


1996  1999  2003  2006 


M.S.  Hrishikesh  et  al.,  “The  Optimal  Logic  Depth  Per  Pipeline  Stage  is  6  to  8  F04  Inverter  Delays.”  29th  ISCA:  2002. 
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CMOS  processor  clock  frequency 


MHz  History  Chart:  see  Mac  Info  Home  Page  . 
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Bipolar  and  CMOS  clock  frequency 


- Bipolar 

- CMOS 

-a-  Limit 


year 
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Bipolar  cooling  technology  (ca  ’91) 


Hitachi  M880:  500  MHz;  one  processor/module,  40  die 
sealed  in  helium  then  cooled  by  a  water  jacket. 
Power  consumed:  about  800  watts  per  module 


F.  Kobayashi,  et  al  .  “Hardware  technology  for  Hitachi  M-880.”  Proceedings  Electronic  Components  and  Tech  Conf.,  1991. 
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Translating  time  into  performance: 

scaling  the  walls 

The  memory  “wall”:  performance  is  limited  by  the 
predictability  &  supply  of  data  from  memory.  This 
depends  on  the  access  time  to  memory  and  thus  on 
wire  delay  which  remains  constant  with  scaling. 

But  (so  far)  significant  hardware  support  (area,  fast 
dynamic  logic)  has  enabled  designers  to  manage 
cache  misses,  branches,  etc  reducing  memory  access. 

There’s  also  a  frequency  (minimum  segment  size) 
and  related  power  “wall”.  Here  the  questions  is  how 
to  improve  performance  without  changing  the 
segment  size  or  increasing  the  frequency. 
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Missing  the  memory  wall 


Cache  size  and  performance 
for  the  Itanium  processor  family 


•  Itanium  processors  now 
have  9MB  of  cache/die, 
moving  to  >  27  MB 

•  A  typical  processor  (in 
2004)  occupies  <  20%  of 
die,  moving  to  5%. 

•  Limited  memory  BW, 
access  time  (cache  miss 
now  over  300~) 

•  Result:  large  cache  and 
efforts  to  improve  memory 
&  bus 


S.  Rusu,  et  al,  “Itanium  2  Processor  6M:  Higher  Frequency  and  Larger  L3  Cache.”  IEEE  Micro,  March/ April  2004. 
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Power 


Old  rough  an’  ready,  Zachary  Taylor 
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Power:  the  real  price  of  performance 


C  ■  V  2  ■  freq 


+  ^  leakage  '  '  + 


SC 


While  Vdd  and  C  (capacity)  decrease,  frequency  increases  at 
an  accelerating  rate  thereby  increasing  power  density 

As  Vdd  decreases  so  does  Vth;  this  increases  I)cak  and  static  power. 
Static  power  is  now  a  big  problem  in  high  performance  designs. 

Net:  while  increasing  frequency  may  or  may  not  increase 
performance  -  it  certainly  does  increase  power 


M.  J.  Flynn 


23 


HPEC  ‘04 


Power 


Cooled  high  power:  >70  W/  die 

High  power:  10-  50  W/  die  ...  plug  in  supply 

Low  power:  0. 1  -  2  W  /  die.,  rechargeable  battery 

Very  low  power:  1-100  mW  /die  ..  AA  size 
batteries 

Extremely  low  power:  1-100  micro  Watt/die  and 
below  (nano  Watts) ..  button  batteries 

No  power:  extract  from  local  EM  field, 

....  0(lpW/die) 


M.  J.  Flynn 
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Achieving  low  power  systems 

By  Vdd  and  device  scaling 

•  By  scaling  alone  a  1  OOOx  slower  implementation 
may  need  only  109  as  much  power. 

•  Gating  power  to  functional  units  and  other 
techniques  should  enable  100MHz  processors  to 
operate  at  0(1  O'3)  watts. 

•  Goal:  0(1  O'6)  watts. . ..  Implies  about  10  MHz 


M.  J.  Flynn 


25 


HPEC  ‘04 


Extremely  Low  Power  Architecture 
(ELPA),  O  (jliW),  getting  there. ... 


1)  Scaling  alone  lowers  power  by  reducing  parasitic  wire  and 
gate  capacitance. 

2)  Lowering  frequency  lowers  power  by  a  cubic 

3)  Slower  clocked  processors  are  better  matched  to  memory 
reducing  caches,  etc 

4)  Asynchronous  clocking  and  logic  eliminate  clock  power  and 
state  transitions. 

5)  Circuits:  adiabatic  switching,  adaptive  body  biasing,  etc 

6)  Integrated  power  management 


M.  J.  Flynn 
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ELP  Technology  &  Architecture 

ELPA  Challenge :  build  a  processor  & 
memory  that  uses  1  O'6  the  power  of  a  high 
performance  MPU  and  has  0. 1  of  the  state 
of  the  art  SPECmark  (ok,  how  about  .01?). 
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Study  1  StrongARM:  450  mW  @  160MHz 


•  Scale  (@22nm)  450mW  becomes  5  mW 

•  Cubic  rule:  5mW  becomes  5pw@  16MHz 

•  Use  lOx  more  area  to  recover  performance 

•  lOx  becomes  3-4x  memory  match  timing. 

•  Asynchronous  logic  may  give  5x  in  perf. 
•Use  sub  threshold  (?)  circuits;  higher  VT 

•  Power  management 


M.  J.  Flynn 
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Study  2  the  Ant 


•  10  Hz,  0.1-1  jliW  250k  neurons;  3D 
packaging,  About  'A  mm3  or  about  1 
micron3  per  neuron 

•  How  many  SPECmarks? 


M.  J.  Flynn 
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And  while  we’re  at  it,  why  not  an  EHP 

challenge? 

Much  more  work  done  in  this  area  but 
limited  applicability 

-  Whole  earth  simulator  (60TFlops) 

-  Grape  series  (40TFlops) 

General  solutions  come  slowly  (like  the 
mills  of  the  Gods) 

-  Streaming  compilers 

-  New  mathematical  models 
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And  don’t  forget  reliability! 


NASA  Langley  Research  Center 
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Computational  Integrity 

Design  for 

-  Reliability 

-  Testability 

-  Serviceability 

-  Process  recoverability 

-  Fail-safe  computation 
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Summary 

•  Embedded  processors/SOC  is  the  major  growth 
area  with  obvious  challenges: 

1)  lOx  speed  without  increasing  power 

2)  10~6  less  power  with  the  same  speed 

3)  lOOx  circuit  complexity  with  same  design  effort. 

•  Beyond  this  the  real  challenge  is  in  system 
design:  new  system  concepts,  interconnect 
technologies,  IP  management  and  design  tools. 
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